HR Analytics Employee Attrition & Performance

BCon 147: special topics

Author

Rogelio D. Cerezo III

Published

October 21, 2024

1 Project Overview

In this project, we will explore employee attrition and performance using the HR Analytics Employee Attrition & Performance data set. The primary goal is to develop insights into the factors that contribute to employee attrition. By analyzing a range of factors, including demographic data, job satisfaction, work-life balance, and job role, we aim to help businesses identify key areas where they can improve employee retention.

2 Scenario

Imagine you are working as a data analyst for a mid-sized company that is experiencing high employee turnover, especially among high-performing employees. The company has been facing increased costs related to hiring and training new employees, and management is concerned about the negative impact on productivity and morale. The human resources (HR) team has collected historical employee data and now looks to you for actionable insights. They want to understand why employees are leaving and how to retain talent effectively.

Your task is to analyze the dataset and provide insights that will help HR prioritize retention strategies. These strategies could include interventions like revising compensation policies, improving job satisfaction, or focusing on work-life balance initiatives. The success of your analysis could lead to significant cost savings for the company and an increase in employee engagement and performance.

3 Understanding data source

The data set used for this project provides information about employee demographics, performance metrics, and various satisfaction ratings. The data set is particularly useful for exploring how factors such as job satisfaction, work-life balance, and training opportunities influence employee performance and attrition.

This data set is well-suited for conducting in-depth analysis of employee performance and retention, enabling us to build predictive models that identify the key drivers of employee attrition. Additionally, we can assess the impact of various organizational factors, such as training and work-life balance, on both performance and retention outcomes.

## datatable function from DT package create an HTML widget display of the dataset
## install DT package if the package is not yet available in your R environment

library(readxl)

dataset_variable_description <- readxl::read_excel("dataset/dataset-variable-description.xlsx") 

DT::datatable(dataset_variable_description)

4 Data wrangling and management

Libraries

Task: Load the necessary libraries

Before we start working on the data set, we need to load the necessary libraries that will be used for data wrangling, analysis and visualization. Make sure to load the following libraries here. For packages to be installed, you can use the install.packages function. There are packages to be installed later on this project, so make sure to install them as needed and load them here.

# load all your libraries here

library(dplyr)
library(readr)
library(DT)
library(janitor)
library(GGally)
library(sjPlot)
library(ggplot2)
library(ggstatsplot)
library(report)
library(scales)

4.1 Data importation

Task 4.1. Merging dataset
  • Import the two dataset Employee.csv and PerformanceRating.csv. Save the Employee.csv as employee_dta and PerformanceRating.csv as perf_rating_dta.

  • Merge the two dataset using the left_join function from dplyr. Use the EmployeeID variable as the varible to join by. You may read more information about the left_join function here.

  • Save the merged dataset as hr_perf_dta and display the dataset using the datatable function from DT package.

## import the two data here

employee_dta <- read_csv("dataset/Employee.csv")
perf_rating_dta <- read_csv("dataset/PerformanceRating.csv")

## merge employee_dta and perf_rating_dta using left_join function.
## save the merged dataset as hr_perf_dta

hr_perf_dta <- left_join(employee_dta, perf_rating_dta, by = "EmployeeID") |> 
  na.omit()



## Use the datatable from DT package to display the merged dataset

datatable(hr_perf_dta)

4.2 Data management

Task 4.2. Standardizing variable names
  • Using the clean_names function from janitor package, standardize the variable names by using the recommended naming of variables.

  • Save the renamed variables as hr_perf_dta to update the dataset.

## clean names using the janitor packages and save as hr_perf_dta

hr_perf_dta <- clean_names(hr_perf_dta)

## display the renamed hr_perf_dta using datatable function

datatable(hr_perf_dta)
Task 4.2. Recode data entries
  • Create a new variable cat_education wherein education is 1 = No formal education; 2 = High school; 3 = Bachelor; 4 = Masters; 5 = Doctorate. Use the case_when function to accomplish this task.

  • Similarly, create new variables cat_envi_sat, cat_job_sat, and cat_relation_sat for environment_satisfaction, job_satisfaction, and relationship_satisfaction, respectively. Re-code the values accordingly as 1 = Very dissatisfied; 2 = Dissatisfied; 3 = Neutral; 4 = Satisfied; and 5 = Very satisfied.

  • Create new variables cat_work_life_balance, cat_self_rating, cat_manager_rating for work_life_balance, self_rating, and manager_rating, respectively. Re-code accordingly as 1 = Unacceptable; 2 = Needs improvement; 3 = Meets expectation; 4 = Exceeds expectation; and 5 = Above and beyond.

  • Create a new variable bi_attrition by transforming attrition variable as a numeric variabe. Re-code accordingly as No = 0, and Yes = 1.

  • Save all the changes in the hr_perf_dta. Note that saving the changes with the same name will update the dataset with the new variables created.

## create cat_education, cat_envi_sat,  cat_job_sat, and cat_relation_sat, cat_work_life_balance, cat_self_rating, and cat_manager_rating, create cat_work_life_balance, cat_self_rating, and cat_manager_rating, create bi_attrition

hr_perf_dta <- hr_perf_dta |> 
  mutate(
    cat_education = case_when(
      education == 1 ~ "No formal education",
      education == 2 ~ "High school",
      education == 3 ~ "Bachelor",
      education == 4 ~ "Masters",
      education == 5 ~ "Doctorate"
    ),
    cat_envi_sat = case_when(
      environment_satisfaction == 1 ~ "Very dissatisfied",
      environment_satisfaction == 2 ~ "Dissatisfied",
      environment_satisfaction == 3 ~ "Neutral",
      environment_satisfaction == 4 ~ "Satisfied",
      environment_satisfaction == 5 ~ "Very satisfied"
    ),
    cat_job_sat = case_when(
      job_satisfaction == 1 ~ "Very dissatisfied",
      job_satisfaction == 2 ~ "Dissatisfied",
      job_satisfaction == 3 ~ "Neutral",
      job_satisfaction == 4 ~ "Satisfied",
      job_satisfaction == 5 ~ "Very satisfied"
    ),
    cat_relation_sat = case_when(
      relationship_satisfaction == 1 ~ "Very dissatisfied",
      relationship_satisfaction == 2 ~ "Dissatisfied",
      relationship_satisfaction == 3 ~ "Neutral",
      relationship_satisfaction == 4 ~ "Satisfied",
      relationship_satisfaction == 5 ~ "Very satisfied"
    ),
    cat_work_life_balance = case_when(
      work_life_balance == 1 ~ "Unacceptable",
      work_life_balance == 2 ~ "Needs improvement",
      work_life_balance == 3 ~ "Meets expectation",
      work_life_balance == 4 ~ "Exceeds expectation",
      work_life_balance == 5 ~ "Above and beyond"
    ),
    cat_self_rating = case_when(
      self_rating == 1 ~ "Unacceptable",
      self_rating == 2 ~ "Needs improvement",
      self_rating == 3 ~ "Meets expectation",
      self_rating == 4 ~ "Exceeds expectation",
      self_rating == 5 ~ "Above and beyond"
    ),
    cat_manager_rating = case_when(
      manager_rating == 1 ~ "Unacceptable",
      manager_rating == 2 ~ "Needs improvement",
      manager_rating == 3 ~ "Meets expectation",
      manager_rating == 4 ~ "Exceeds expectation",
      manager_rating == 5 ~ "Above and beyond"
    ),
    bi_attrition = ifelse(attrition == "Yes", 1, 0)
  )


## print the updated hr_perf_dta using datatable function

datatable(hr_perf_dta)

5 Exploratory data analysis

5.1 Descriptive statistics of employee attrition

Task 5.1. Breakdown of attrition by key variables
  • Select the variables attrition, job_role, department, age, salary, job_satisfaction, and work_life_balance. Save as attrition_key_var_dta.

  • Compute and plot the attrition rate across job_role, department, and age, salary, job_satisfaction, and work_life_balance. To compute for the attrition rate, group the dataset by job role. Afterward, you can use the count function to get the frequency of attrition for each job role and then divide it by the total number of observations. Save the computation as pct_attrition. Do not forget to ungroup before storing the output. Store the output as attrition_rate_job_role.

  • Plot for the attrition rate across job_role has been done for you! Study each line of code. You have the freedom to customize your plot accordingly. Show your creativity!

Select Attrition Key Variables

# selecting attrition key variables and save as `attrition_key_var_dta`

attrition_key_var_dta <- hr_perf_dta |> 
  select(attrition, job_role, department, age, salary, job_satisfaction, work_life_balance)

Attrition Rate by Job Role

# compute the attrition rate across job_role and save as attrition_rate_job_role

attrition_rate_job_role <- attrition_key_var_dta |> 
  group_by(job_role) |> 
  summarise(
    total = n(),
    attrition_count = sum(attrition == "Yes"),
    pct_attrition = round(attrition_count / total * 100, 2)
  ) |> 
  ungroup()

# print attrition_rate_job_role

print(attrition_rate_job_role)
# A tibble: 13 × 4
   job_role                  total attrition_count pct_attrition
   <chr>                     <int>           <int>         <dbl>
 1 Analytics Manager           208              28         13.5 
 2 Data Scientist             1360             597         43.9 
 3 Engineering Manager         302              18          5.96
 4 HR Business Partner          23               0          0   
 5 HR Executive                115              29         25.2 
 6 HR Manager                   16               0          0   
 7 Machine Learning Engineer   558              95         17.0 
 8 Manager                     140              19         13.6 
 9 Recruiter                   149              86         57.7 
10 Sales Executive            1521             543         35.7 
11 Sales Representative        489             317         64.8 
12 Senior Software Engineer    494              84         17   
13 Software Engineer          1334             445         33.4 
datatable(attrition_rate_job_role)
# plot the attrition rate by job role

ggplot(attrition_rate_job_role, aes(x = reorder(job_role, -pct_attrition), y = pct_attrition)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  theme_minimal() +
  labs(title = "Attrition Rate by Job Role", x = "Job Role", y = "Attrition Rate (%)") +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),  
    plot.title = element_text(hjust = 0.5)  
  ) +
  scale_y_continuous(labels = scales::percent_format(scale = 1))

Attrition Rate by Department

# compute the attrition rate across department and save as attrition_rate_dept

attrition_rate_dept <- attrition_key_var_dta |>
  group_by(department) |>
  summarise(
    total = n(),
    attrition_count = sum(attrition == "Yes"),
    pct_attrition = round(attrition_count / total * 100, 2)
  ) |>
  ungroup()

# print attrition_rate_dept

print(attrition_rate_dept)
# A tibble: 3 × 4
  department      total attrition_count pct_attrition
  <chr>           <int>           <int>         <dbl>
1 Human Resources   303             115          38.0
2 Sales            2149             879          40.9
3 Technology       4257            1267          29.8
datatable(attrition_rate_dept)
# plot the attrition rate by department

ggplot(attrition_rate_dept, aes(x = reorder(department, -pct_attrition), y = pct_attrition)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  theme_minimal() +
  labs(title = "Attrition Rate by Department", x = "Department", y = "Attrition Rate (%)") +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),
    plot.title = element_text(hjust = 0.5)
  ) +
  scale_y_continuous(labels = scales::percent_format(scale = 1))

Attrition Rate by Age Groups

# first create age groups

attrition_key_var_dta <- attrition_key_var_dta |>
  mutate(age_group = case_when(
    age < 25 ~ "18-24",
    age >= 25 & age < 35 ~ "25-34",
    age >= 35 & age < 45 ~ "35-44",
    age >= 45 & age < 55 ~ "45-54",
    TRUE ~ "55+"
  ))

# compute the attrition rate across age and save as attrition_rate_age

attrition_rate_age <- attrition_key_var_dta |>
  group_by(age_group) |>
  summarise(
    total = n(),
    attrition_count = sum(attrition == "Yes"),
    pct_attrition = round(attrition_count / total * 100, 2)
  ) |>
  ungroup()

# print attrition_rate_age

print(attrition_rate_age)
# A tibble: 4 × 4
  age_group total attrition_count pct_attrition
  <chr>     <int>           <int>         <dbl>
1 18-24      1511             880         58.2 
2 25-34      3175            1164         36.7 
3 35-44      1505             169         11.2 
4 45-54       518              48          9.27
datatable(attrition_rate_age)
# plot the attrition rate by age groups

ggplot(attrition_rate_age, aes(x = age_group, y = pct_attrition)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  theme_minimal() +
  labs(title = "Attrition Rate by Age Group", x = "Age Group", y = "Attrition Rate (%)") +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),
    plot.title = element_text(hjust = 0.5)
  ) +
  scale_y_continuous(labels = scales::percent_format(scale = 1))

Attrition Rate by Salary Ranges

# first create salary ranges

attrition_key_var_dta <- attrition_key_var_dta |>
  mutate(salary_range = case_when(
    salary < 25000 ~ "< 25,000",
    salary >= 25000 & salary < 50000 ~ "25,000-50,000",
    salary >= 50000 & salary < 75000 ~ "50,000-75,000",
    salary >= 75000 & salary < 100000 ~ "75,000-100,000",
    salary >= 100000 & salary < 125000 ~ "100,000-125,000",
    salary >= 125000 & salary < 150000 ~ "125,000-150,000",
    salary >= 150000 & salary < 175000 ~ "150,000-175,000",
    salary >= 175000 & salary < 200000 ~ "175,000-200,000",    
    TRUE ~ "200,000+"
  ))

# compute the attrition rate across salary and save as attrition_rate_salary

attrition_rate_salary <- attrition_key_var_dta |>
  group_by(salary_range) |>
  summarise(
    total = n(),
    attrition_count = sum(attrition == "Yes"),
    pct_attrition = round(attrition_count / total * 100, 2)
  ) |>
  ungroup()

# print attrition_rate_salary

print(attrition_rate_salary)
# A tibble: 9 × 4
  salary_range    total attrition_count pct_attrition
  <chr>           <int>           <int>         <dbl>
1 100,000-125,000   569              99          17.4
2 125,000-150,000   432             109          25.2
3 150,000-175,000   217              38          17.5
4 175,000-200,000   250              49          19.6
5 200,000+         1025             213          20.8
6 25,000-50,000    1805             909          50.4
7 50,000-75,000    1265             404          31.9
8 75,000-100,000    885             238          26.9
9 < 25,000          261             202          77.4
datatable(attrition_rate_salary)
# plot attrition rate by salary ranges

ggplot(attrition_rate_salary, aes(x = reorder(salary_range, -pct_attrition), y = pct_attrition)) +  
  geom_bar(stat = "identity", fill = "steelblue") +
  theme_minimal() +
  labs(title = "Attrition Rate by Salary Range", x = "Salary Range", y = "Attrition Rate (%)") +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),
    plot.title = element_text(hjust = 0.5)
  ) +
  scale_y_continuous(labels = scales::percent_format(scale = 1))

Attrition Rate by Job Satisfaction

# attrition rate by job satisfaction

attrition_rate_jobsat <- attrition_key_var_dta |>
  group_by(job_satisfaction) |>
  summarise(
    total = n(),
    attrition_count = sum(attrition == "Yes"),
    pct_attrition = round(attrition_count / total * 100, 2)
  ) |>
  ungroup()

# print attrition_rate_jobsat

print(attrition_rate_jobsat)
# A tibble: 5 × 4
  job_satisfaction total attrition_count pct_attrition
             <dbl> <int>           <int>         <dbl>
1                1   130              36          27.7
2                2  1674             549          32.8
3                3  1651             568          34.4
4                4  1685             573          34.0
5                5  1569             535          34.1
datatable(attrition_rate_jobsat)
# plot attrition rate by job satisfaction

ggplot(attrition_rate_jobsat, aes(x = job_satisfaction, y = pct_attrition)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  theme_minimal() +
  labs(title = "Attrition Rate by Job Satisfaction", x = "Job Satisfaction Level", y = "Attrition Rate (%)") +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),
    plot.title = element_text(hjust = 0.5)
  ) +
  scale_y_continuous(labels = scales::percent_format(scale = 1))

Attrition Rate by Work Life Balance

# compute the attrition rate across work_life_balance and save as attrition_rate_wlb

attrition_rate_wlb <- attrition_key_var_dta |>
  group_by(work_life_balance) |>
  summarise(
    total = n(),
    attrition_count = sum(attrition == "Yes"),
    pct_attrition = round(attrition_count / total * 100, 2)
  ) |>
  ungroup()

# print attrition_rate_wlb

print(attrition_rate_wlb)
# A tibble: 5 × 4
  work_life_balance total attrition_count pct_attrition
              <dbl> <int>           <int>         <dbl>
1                 1   121              37          30.6
2                 2  1702             568          33.4
3                 3  1670             580          34.7
4                 4  1706             560          32.8
5                 5  1510             516          34.2
datatable(attrition_rate_wlb)
# plot attrition rate by work-life balance

ggplot(attrition_rate_wlb, aes(x = work_life_balance, y = pct_attrition)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  theme_minimal() +
  labs(title = "Attrition Rate by Work-Life Balance", x = "Work-Life Balance Level", y = "Attrition Rate (%)") +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),
    plot.title = element_text(hjust = 0.5)
  ) +
  scale_y_continuous(labels = scales::percent_format(scale = 1))

5.2 Identifying attrition key drivers using correlation analysis

Task 5.2. Conduct a correlation analysis to identify key drivers
  • Conduct a correlation analysis of key variables: bi_attrition, salary, years_at_company, job_satisfaction, manager_rating, and work_life_balance. Use the cor() function to run the correlation analysis. Remove missing values using the na.omit() before running the correlation analysis. Save the output in hr_corr.

  • Use a correlation matrix or heatmap to visualize the relationship between these variables and attrition. You can use the GGally package and use the ggcorr function to visualize the correlation heatmap. You may explore this site for more information: ggcorr.

  • Discuss which factors seem most correlated with attrition and what that suggests about why employees are leaving.

## conduct correlation of key variables. 

hr_corr <- hr_perf_dta |> 
  select(bi_attrition, salary, years_at_company, job_satisfaction, 
         manager_rating, work_life_balance) |> 
  na.omit() |> 
  cor()

## print hr_corr 

print(hr_corr)
                  bi_attrition       salary years_at_company job_satisfaction
bi_attrition       1.000000000 -0.211181478    -0.6896527798     0.0132368129
salary            -0.211181478  1.000000000     0.2206442116     0.0053054850
years_at_company  -0.689652780  0.220644212     1.0000000000     0.0008700583
job_satisfaction   0.013236813  0.005305485     0.0008700583     1.0000000000
manager_rating    -0.007654429 -0.001596736     0.0178656879    -0.0158205481
work_life_balance  0.003428836 -0.001517145     0.0079339508     0.0417242942
                  manager_rating work_life_balance
bi_attrition        -0.007654429       0.003428836
salary              -0.001596736      -0.001517145
years_at_company     0.017865688       0.007933951
job_satisfaction    -0.015820548       0.041724294
manager_rating       1.000000000       0.007996938
work_life_balance    0.007996938       1.000000000
datatable(hr_corr)
## install GGally package and use ggcorr function to visualize the correlation

ggcorr(hr_corr,
      method = c("everything", "pearson"),
      label = TRUE,
      label_size = 3,
      label_round = 2,
      label_color = "black",
      name = "Correlation",
      title = "Correlation Analysis of Employee Attrition Factors",
      legend.position = "right",
      low = "#4477AA",  # Dark blue for negative correlations
      high = "#EE6677",  # Red for positive correlations
      mid = "white",    # White for neutral correlations
      ) +
  theme(
    title = element_text(size = 12, face = "bold"),
    axis.text.x = element_text(angle = 45, hjust = 1),
    plot.title = element_text(hjust = 0.5)
  ) +
  labs(
    title = "Correlation Analysis of Key Drivers for Employee Attrition"
  )

Discussion:

The correlation analysis of the key drivers for employee attrition is categorized into three segments. Those that are in color white have neutral or negligible correlations. Those that are in red have positive correlations. Those that are in dark blue have negative correlations.

The correlation analysis of the primary factors most correlated with attrition are salary and years at company. The interpretation are as follows:

The correlation analysis shows that salary has a moderate negative correlation with attrition (-0.5), this implies that employees with higher salaries are less likely to leave the company.

The correlation analysis also show that years at company has a very strong negative correlation with attrition (-0.96), this implies that as the number of years an employee stays with the company increases, the likelihood of them leaving decreases significantly.

These primary factors imply that employees with lower salaries are more inclined to leaving the company. Additionally, the longer they are with the company, the more they are inclined to stay with the company. So, it’s important to take care of the employees to ensure that they won’t quit their jobs.

5.3 Predictive modeling for attrition

Task 5.3. Predictive modeling for attrition
  • Create a logistic regression model to predict employee attrition using the following variables: salary, years_at_company, job_satisfaction, manager_rating, and work_life_balance. Save the model as hr_attrition_glm_model. Print the summary of the model using the summary function.

  • Install the sjPlot package and use the tab_model function to display the summary of the model. You may read the documentation here on how to customize your model summary.

  • Also, use the plot_model function to visualize the model coefficients. You may read the documentation here on how to customize your model visualization.

  • Discuss the results of the logistic regression model and what they suggest about the factors that contribute to employee attrition.

## run a logistic regression model to predict employee attrition
## save the model as hr_attrition_glm_model

hr_attrition_glm_model <- glm(
  bi_attrition ~ salary + years_at_company + job_satisfaction + 
    manager_rating + work_life_balance,
  data = hr_perf_dta,
  family = binomial()
)

## print the summary of the model using the summary function

summary(hr_attrition_glm_model)

Call:
glm(formula = bi_attrition ~ salary + years_at_company + job_satisfaction + 
    manager_rating + work_life_balance, family = binomial(), 
    data = hr_perf_dta)

Coefficients:
                    Estimate Std. Error z value Pr(>|z|)    
(Intercept)        2.571e+00  2.173e-01  11.831   <2e-16 ***
salary            -3.633e-06  4.086e-07  -8.893   <2e-16 ***
years_at_company  -6.333e-01  1.476e-02 -42.919   <2e-16 ***
job_satisfaction   3.470e-02  3.186e-02   1.089    0.276    
manager_rating     5.071e-03  3.810e-02   0.133    0.894    
work_life_balance  2.587e-02  3.198e-02   0.809    0.419    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 8574.5  on 6708  degrees of freedom
Residual deviance: 4781.6  on 6703  degrees of freedom
AIC: 4793.6

Number of Fisher Scoring iterations: 5
## install sjPlot package and use tab_model function to display the summary of the model

tab_model(hr_attrition_glm_model,
          title = "Factors Influencing Employee Attrition",
          show.aic = TRUE,
          show.ci = TRUE,
          string.pred = "Predictor",
          string.est = "Odds Ratio",
          dv.labels = "Employee Attrition")
Factors Influencing Employee Attrition
  Employee Attrition
Predictor Odds Ratio CI p
(Intercept) 13.08 0.00 – Inf <0.001
salary 1.00 0.00 – Inf <0.001
years at company 0.53 0.00 – Inf <0.001
job satisfaction 1.04 0.00 – Inf 0.276
manager rating 1.01 0.00 – Inf 0.894
work life balance 1.03 0.00 – Inf 0.419
Observations 6709
R2 Tjur 0.502
AIC 4793.596
## use plot_model function to visualize the model coefficients

plot_model(hr_attrition_glm_model,
           title = "Effects on Employee Attrition",
           show.values = TRUE,
           value.offset = .3,
           sort.est = TRUE,
           vline.color = "red",
           colors = "blue") +
  theme_minimal() +
  labs(x = "Odds Ratio",
       y = "Predictors")

Discussion:

The logistic regression assesses the effects of multiple key variables in predicting employee attrition, and the results with each variable are as follows:

  1. Job Satisfaction has a small positive association with retention, suggesting that if employees have positive job satisfaction, there is a modest increase in the likelihood of them continuing to stay with their job.
  2. Work Life Balance also exhibits a small positive association with retention, suggesting that if employees are able to maintain a positive work life balance, there is a modest increase in the likelihood of them continuing to stay with their job.
  3. Manager rating exhibits the smallest effect, however, it still has a positive association with retention, although quite small. This implies that if an employee receives positive ratings from their manager, it means that there is a minimal increase in the likelihood of them continuing to stay with their job.
  4. Salary is shown to not be a major driver of employee retention decisions, but is still a highly significant variable. However, the fact that it does have a huge affect on attrition despite being a highly significant variable is concerning with the analysis.
  5. Years at Company is a highly significant variable alongside salary, and it exhibits the strongest effect among all factors, suggesting longer tenure is associated with lower attrition risk.

5.4 Analysis of compensation and turnover

Task 5.4. Analyzing compensation and turnover
  • Compare the average monthly income of employees who left the company (bi_attrition = 1) and those who stayed (bi_attrition = 0). Use the t.test function to conduct a t-test and determine if there is a significant difference in average monthly income between the two groups. Save the results in a variable called attrition_ttest_results.

  • Install the report package and use the report function to generate a report of the t-test results.

  • Install the ggstatsplot package and use the ggbetweenstats function to visualize the distribution of monthly income for employees who left and those who stayed. Make sure to map the bi_attrition variable to the x argument and the salary variable to the y argument.

  • Visualize the salary variable for employees who left and those who stayed using geom_histogram with geom_freqpoly. Make sure to facet the plot by the bi_attrition variable and apply alpha on the histogram plot.

  • Provide recommendations on whether revising compensation policies could be an effective retention strategy.

## compare the average monthly income of employees who left and those who stayed

attrition_ttest_results <- t.test(
  salary ~ bi_attrition,
  data = hr_perf_dta)

## print the results of the t-test

print(attrition_ttest_results)

    Welch Two Sample t-test

data:  salary by bi_attrition
t = 19.074, df = 5557.5, p-value < 2.2e-16
alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
95 percent confidence interval:
 39387.67 48411.52
sample estimates:
mean in group 0 mean in group 1 
      125856.35        81956.76 
## install the report package and use the report function to generate a report of the t-test results

report(attrition_ttest_results)
Effect sizes were labelled following Cohen's (1988) recommendations.

The Welch Two Sample t-test testing the difference of salary by bi_attrition
(mean in group 0 = 1.26e+05, mean in group 1 = 81956.76) suggests that the
effect is positive, statistically significant, and medium (difference =
43899.59, 95% CI [39387.67, 48411.52], t(5557.53) = 19.07, p < .001; Cohen's d
= 0.51, 95% CI [0.46, 0.57])
# install ggstatsplot package and use ggbetweenstats function to visualize the distribution of monthly income for employees who left and those who stayed

ggbetweenstats(
  data = hr_perf_dta,
  x = bi_attrition,
  y = salary,
  title = "Monthly Income Distribution by Attrition Status",
  subtitle = "Comparing Employees Who Left vs Stayed",
  type = "parametric", 
  messages = FALSE,
  xlab = "Employee Status",
  ylab = "Monthly Income",
  palette = "Set2",
  plot.type = "violin",
  results.subtitle = TRUE,
  bf.message = FALSE
) +
  scale_y_continuous(labels = label_comma()) +
  scale_x_discrete(labels = c("0" = "Stayed", "1" = "Left")) 

# create histogram and frequency polygon of salary for employees who left and those who stayed

ggplot(hr_perf_dta, 
       aes(x = salary, 
           fill = factor(bi_attrition, 
                         labels = c("Stayed", "Left")))) +
  geom_histogram(alpha = 0.5, 
                 position = "identity", 
                 bins = 30,
                 aes(y = ..density..)) +
  geom_freqpoly(aes(color = factor(bi_attrition, 
                                   labels = c("Stayed", "Left")),
                    y = ..density..), 
                size = 1, 
                bins = 30) +
  facet_wrap(~bi_attrition, 
             labeller = labeller(bi_attrition = 
                                   c("0" = "Stayed", "1" = "Left"))) +
  scale_x_continuous(labels = label_comma()) +
  scale_y_continuous(labels = label_comma()) +
  scale_fill_brewer(palette = "Set2") +
  scale_color_brewer(palette = "Set2") +
  labs(title = "Salary Distribution Comparison",
       subtitle = "Density Distribution of Monthly Income by Attrition Status",
       x = "Monthly Income ($)",
       y = "Density",
       fill = "Employment Status",
       color = "Employment Status") +
  theme_minimal() +
  theme(
    legend.position = "bottom",
    plot.title = element_text(face = "bold"),
    axis.title = element_text(face = "bold"),
    strip.text = element_text(face = "bold")
  )

# Salary Summary in DataTable (EXTRA CONTENT)

salary_summary <- hr_perf_dta |>
  group_by(bi_attrition) |>
  summarise(
    mean_salary = mean(salary),
    median_salary = median(salary),
    sd_salary = sd(salary),
    q1_salary = quantile(salary, 0.25),
    q3_salary = quantile(salary, 0.75),
    n = n()
  ) |> 
  mutate(group = if_else(bi_attrition == 0, "Stayed", "Left")) 

datatable(salary_summary)
Discussion:

Based on the Salary Distribution Comparison chart and the Data Table Summary I created, we are able to identify that the mean salary for those who stayed is around 125,856 and for those who left is around 81,957, and this represents an approximate 54% difference in average compensation between both groups. Additionally, both groups show right-skewed distributions, and the “stayed” group has a wider spread with a standard deviation of approximately 102,749 vs the “left” group with approximately 81,304.

So, it appears that higher-paid employees are more likely to stay and lower-paid employees are more likely to leave.

Therefore, a revision of compensation policies in implementing targeted salary adjustments for employees whose current compensation falls below the market rate, particularly those within the “stayed” group would help mitigate the risk of attrition among lower-paid employees and improve overall job satisfaction.

5.5 Employee satisfaction and performance analysis

Task 5.5. Analyzing employee satisfaction and performance
  • Analyze the average performance ratings (both ManagerRating and SelfRating) of employees who left vs. those who stayed. Use the group_by and count functions to calculate the average performance ratings for each group.

  • Visualize the distribution of SelfRating for employees who left and those who stayed using a bar plot. Use the ggplot function to create the plot and map the SelfRating variable to the x argument and the bi_attrition variable to the fill argument.

  • Similarly, visualize the distribution of ManagerRating for employees who left and those who stayed using a bar plot. Make sure to map the ManagerRating variable to the x argument and the bi_attrition variable to the fill argument.

  • Create a boxplot of salary by job_satisfaction and bi_attrition to analyze the relationship between salary, job satisfaction, and attrition. Use the geom_boxplot function to create the plot and map the salary variable to the x argument, the job_satisfaction variable to the y argument, and the bi_attrition variable to the fill argument. You need to transform the job_satisfaction and bi_attrition variables into factors before creating the plot or within the ggplot function.

  • Discuss the results of the analysis and provide recommendations for HR interventions based on the findings.

# Analyze the average performance ratings (both ManagerRating and SelfRating) of employees who left vs. those who stayed.

performance_by_attrition <- hr_perf_dta |>
  group_by(bi_attrition) |>
  summarise(
    avg_self_rating = mean(self_rating, na.rm = TRUE),
    avg_manager_rating = mean(manager_rating, na.rm = TRUE),
    count = n())
# Visualize the distribution of SelfRating for employees who left and those who stayed using a bar plot.

ggplot(hr_perf_dta, aes(x = self_rating, fill = factor(bi_attrition))) +
  geom_bar(position = "dodge") +
  scale_fill_manual(values = c("0" = "#69b3a2", "1" = "#e15759"),
                    labels = c("Stayed", "Left")) +
  labs(title = "Distribution of Self Ratings by Attrition Status",
       x = "Self Rating",
       y = "Count",
       fill = "Attrition Status") +
  theme_minimal()

# Visualize the distribution of ManagerRating for employees who left and those who stayed using a bar plot.

ggplot(hr_perf_dta, aes(x = manager_rating, fill = factor(bi_attrition))) +
  geom_bar(position = "dodge") +
  scale_fill_manual(values = c("0" = "#69b3a2", "1" = "#e15759"),
                    labels = c("Stayed", "Left")) +
  labs(title = "Distribution of Manager Ratings by Attrition Status",
       x = "Manager Rating",
       y = "Count",
       fill = "Attrition Status") +
  theme_minimal()

# create a boxplot of salary by job_satisfaction and bi_attrition to analyze the relationship between salary, job satisfaction, and attrition.

ggplot(hr_perf_dta, aes(x = factor(job_satisfaction), 
                        y = salary, 
                        fill = factor(bi_attrition))) +
  geom_boxplot() +
  scale_fill_manual(values = c("0" = "#69b3a2", "1" = "#e15759"),
                    labels = c("Stayed", "Left")) +
  labs(title = "Salary Distribution by Job Satisfaction and Attrition",
       x = "Job Satisfaction Level",
       y = "Salary",
       fill = "Attrition Status") +
  theme_minimal() +
  scale_y_continuous(labels = label_comma()) 

datatable(hr_perf_dta)
hr_perf_dta |> 
  filter(is.na(job_satisfaction)) |>
  select(job_satisfaction, salary, bi_attrition)
# A tibble: 0 × 3
# ℹ 3 variables: job_satisfaction <dbl>, salary <dbl>, bi_attrition <dbl>
Discussion:

The box plot describes the Salary Distribution by Job Satisfaction and Attrition.

At Level 1 Job Satisfaction, we can observe that the salary range is relatively narrow compared to other levels, and the median salary for those who stayed is slightly higher than for those who left.

At Level 2 Job Satisfaction, we can observe that there is a wider salary range compared to Level 1, there is also a clearer difference in median salary between those who stayed and left.

At Level 3 Job Satisfaction, we can observe that there is a broader salary range with significant overlap between those who stayed and left, the median salary for those who stayed is also noticeably higher than for those who left.

At Level 4 Job Satisfaction, we can observe that there is a very wide salary range compared to the previous levels, especially for those who stayed, and that there is a clear difference in median salary between stayers and leavers.

At Level 5 Satisfaction, we can observe that this level exhibits the widest salary range of all levels, and that the highest median salaries for both those who stayed and left. Yet, some employees still left despite high satisfaction and high salaries.

Based on the salary distribution analysis across job satisfaction levels and attrition status, regular salary reviews are important to ensure both internal equity and external competitiveness, particularly for employees at lower satisfaction levels. Most importantly, manager training is critical, because there are still employees that quit despite showing high satisfaction level with their jobs. Thus, equipping managers with the skills to effectively address satisfaction issues specific to each level will enable them to provide tailored support and interventions, potentially heading off attrition before it occurs.

5.6 Work-life balance and retention strategies

Task 5.6. Analyzing work-life balance and retention strategies

At this point, you are already well aware of the dataset and the possible factors that contribute to employee attrition. Using your R skills, accomplish the following tasks:

  • Analyze the distribution of WorkLifeBalance ratings for employees who left versus those who stayed.

  • Use visualizations to show the differences.

  • Assess whether employees with poor work-life balance are more likely to leave.

You have the freedom how you will accomplish this task. Be creative and provide insights that will help HR develop effective retention strategies.

# task 5.6 stacked bar plot

wlb_summary <- hr_perf_dta |>
  group_by(bi_attrition, cat_work_life_balance) |>
  summarise(
    count = n(),
    pct = n() / nrow(hr_perf_dta) * 100
  )

ggplot(wlb_summary, aes(x = cat_work_life_balance, y = pct, fill = factor(bi_attrition))) +
  geom_bar(stat = "identity", position = "fill") +
  scale_fill_manual(values = c("0" = "#69b3a2", "1" = "#e15759"),
                    labels = c("Stayed", "Left")) +
  labs(title = "Work-Life Balance Distribution by Attrition Status",
       subtitle = "Showing proportion of employees who left vs stayed",
       x = "Work-Life Balance Rating",
       y = "Proportion",
       fill = "Employment Status") +
  theme_minimal() +
  scale_x_discrete(labels = c("Unacceptable", "Needs Improvement", 
                              "Meets Expectation", "Exceeds Expectation", 
                              "Above and Beyond")) +
  scale_y_continuous(labels = scales::percent_format(accuracy = 1))

Discussion:

The Work-Life Balance Distribution by Attrition Status chart provides valuable insights into the relationship between employees’ perceived work-life balance and their likelihood of leaving the company. The data reveals a clear trend: as work-life balance ratings improve, employee retention rates increases.

Employees rating their work-life balance as “Unacceptable” or “Needs Improvement” show the highest attrition rates. This suggests that poor work-life balance is a strong predictor of employee turnover. Additionally, at the “Meets Expectations” level, attrition is more or less similar to the “Unacceptable” level, which that meeting basic work-life balance expectations may be a crucial threshold for retention.

On the positive levels, the difference in attrition between “Meets Expectation” and “Exceeds Expectation” categories is minimal, implying that satisfactory work-life balance might be sufficient for many employees. The “Above and Beyond” category shows the lowest attrition rate at around 30%, demonstrating that exceptional work-life balance correlates with the highest retention. However, it is important to note that even in the best-rated category, a significant proportion of employees still leave.

These findings suggest that focusing on improving work-life balance, particularly for employees in the lower-rated categories, could yield substantial benefits in reducing attrition. The HR team should look into initiatives that can help shift employees from “Unacceptable” and “Needs Improvement” categories to at least “Meets Expectation” to see the most significant impact on retention rates.

5.7 Recommendations for HR interventions

Task 5.7. Recommendations for HR interventions

Based on the analysis conducted, provide recommendations for HR interventions that could help reduce employee attrition and improve overall employee satisfaction and performance. You may use the following question as guide for your recommendations and discussions.

  • What are the key factors contributing to employee attrition in the company?

  • Which factors are most strongly correlated with attrition?

  • What strategies could be implemented to improve employee retention and satisfaction?

  • How can HR leverage the insights from the analysis to develop effective retention strategies?

  • What are the potential benefits of implementing these strategies for the company?

Discussion:

I implore the HR to look into the correlation analysis of the key drivers for employee attrition as it is able to identify the significant variables that affect employee attrition. There are multitude of interrelations between attrition and the other key drivers, however, it is important to prioritize looking into the primary drivers which are employee salary and years at company, and afterwards they can look into the employee related factors like job satisfaction, manager rating, and work life balance.

Looking into the primary drivers (employee salary, years at company)

The HR could look into conducting a comprehensive salary review to make sure that the employees aren’t unsatisfied with their compensation. Perhaps, it would be beneficial to have a performance-based bonus to increase employee’s motivation.

They could also conduct targeted retention strategies for new employees by assigning them mentors to help them integrate into the company more comfortably. For the long-term employees, the HR could offer paid programs for them to undergo specialized training to widen and improve their skill set which can also serve as a benefit as there will be added opportunities to the company if the employee has more to offer.

Most importantly, we cannot forget the employee related factors like job satisfaction, manager rating, and work life balance. Perhaps, the HR team could provide wellness programs and stress management resources. It is also important to build rapport within the company. So, it is important that there are initiatives that can provide as an avenue for the employees to interact with each other and be happy.

For instance, in my previous internship at RAFI MFI., we had the opportunity to attend their “TGIF” wherein, every last Friday of every month, the work hours are strictly cut early so that in the afternoon, everyone can gather to drink and eat food together. From my observation, this has made their work environment quite light, not too serious, and overall, quite fun.

Aside from that, the HR team could also conduct regular employee surveys to identify areas of dissastisfaction where it can be immediately addressed for improvement.

The multiple analyses conducted in this project allows the HR team to gather relevant insights. An action they could take is they can capitalize in creating a data dashboard, perhaps through PowerBI, this dashboard enables easy-access monitoring of the important development of the related employee attrition variables that can be quickly delivered to the relevant personnel who can take actionable insights from data driven decision making, which the data from this project is able to provide.

This in turn can allow for lower attrition rates, leading to lower recruitment and training costs. Improved employee productivity and engagement. Enhanced company reputation as an employer of choice. Better continuity in projects and client relationships. Increased innovation and creativity from a more satisfied workforce. Improved overall company performance and profitability.